HomeLLC assignment for data scientist position

Goal

  • Describe all the variables that will influence residential home prices in the United States over the next 10 years.

Steps

  • Get data and preprocess
  • EDA
  • Conclusion

Get data

The data has been collected from different sources for analysis are listed below.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sbn
import plotly_express as px
import us
import plotly.offline as py
import plotly.graph_objects as go
import us
import plotly.io as pio

py.init_notebook_mode()

Get house price data

This has been collected from zillow. It containes the ZHVI($, middle tier houses) with region name from year 1996 to 2020.

Zillow Home Value Index (ZHVI): A smoothed, seasonally adjusted measure of the typical home value and market changes across a given region and housing type. It reflects the typical value for homes in the 35th to 65th percentile range.

In [3]:
data = pd.read_csv("Metro_zhvi_uc_sfrcondo_tier_0.33_0.67_raw_mon.csv")
data.head()
Out[3]:
RegionID SizeRank RegionName RegionType StateName 1996-01-31 1996-02-29 1996-03-31 1996-04-30 1996-05-31 ... 2020-01-31 2020-02-29 2020-03-31 2020-04-30 2020-05-31 2020-06-30 2020-07-31 2020-08-31 2020-09-30 2020-10-31
0 102001 0 United States Country NaN 107821.867678 107654.603009 107713.705054 107918.942733 108164.619841 ... 249203.331433 250643.259543 252629.014281 254150.724751 254549.565967 255614.305037 257585.290696 259831.677123 262360.574275 265284.724765
1 394913 1 New York, NY Msa NY 189934.272735 188204.824639 187816.624546 187498.218465 187389.402039 ... 483951.088045 485839.712243 488194.165391 491315.870256 491982.576068 492622.948786 496247.290834 499415.566341 505450.065658 511262.557585
2 753899 2 Los Angeles-Long Beach-Anaheim, CA Msa CA 187347.729377 186677.475492 186385.069989 186361.343962 186230.020967 ... 685244.538647 695398.255181 703661.731518 704376.924417 691466.082540 695103.278425 703758.462391 713145.770130 721717.599064 734809.502703
3 394463 3 Chicago, IL Msa IL 162897.743380 162896.882478 161878.897187 163450.517371 164035.887900 ... 244497.884662 245640.732803 247185.110403 247995.539082 248650.973753 249453.850439 251404.948874 252758.291806 256131.418453 258465.669236
4 394514 4 Dallas-Fort Worth, TX Msa TX 114348.478842 114263.285984 114586.607256 115120.218038 115361.847995 ... 260128.871786 261636.782641 263331.041996 264196.814157 264868.729373 266565.066854 267599.144696 269985.953332 272021.390173 274991.169069

5 rows × 303 columns

In [4]:
# extract city from region name
data.drop(0,axis=0,inplace=True)
data['city'] = data.RegionName.apply(lambda x: x.split(',')[0])
In [5]:
dates = data.columns.tolist()[5:-1]
In [6]:
df = data[['StateName']+dates].melt(id_vars=["StateName"], 
        var_name="date", 
        value_name="price")
df.rename(columns={'StateName':'state'},inplace=True)
df.head()
Out[6]:
state date price
0 NY 1996-01-31 189934.272735
1 CA 1996-01-31 187347.729377
2 IL 1996-01-31 162897.743380
3 TX 1996-01-31 114348.478842
4 PA 1996-01-31 121820.720117
In [7]:
df.price.fillna(0,inplace=True)
df.date = pd.to_datetime(df.date)
df['year'] = df.date.dt.year
df.isna().sum()
Out[7]:
state    0
date     0
price    0
year     0
dtype: int64
In [8]:
px.line(df[['year','price']].groupby('year').mean().reset_index(), 
        x='year', 
        y='price',
        title='Average house price from year 1996 to 2020')
  • Average house price is increasing with time.
In [9]:
states = df[['state','date','price']].groupby(['state']).mean().reset_index().sort_values(by='price',ascending=False)[:5]['state'].values.tolist()
px.line(df[df['state'].isin(states)][['state','date','price']].groupby(['state','date']).median().reset_index(), 
        x='date', 
        y='price',
        color='state',
    title='Top 5 expensive state\'s House value index($) from year 1998 to 2020 by state')
  • CA, CT, DC, HI and MA are the states with most expensive houses.
  • Prices are increasing with time.
  • Between year 2006 to 2008 there was hike in the house prices

Get state wise GDP

Here let's take some key factore which can imapact the housing prices in usa. One of the key factore is economy which is generally measured by GDP, employment rate etc.

Let's first get the data for GDP. This data is collected from https://apps.bea.gov/. which containes the percentage of change in gdp from preceding year. The time period of the data is from year 1997 to 2019

In [10]:
gdp = pd.read_excel('gdp_by_state.xls',skiprows=5,skipfooter=3)
gdp.drop(0,axis=0,inplace=True)
gdp.drop('GeoFips',axis=1,inplace=True)
In [11]:
gdp.head()
Out[11]:
GeoName 1997-1998 1998-1999 1999-2000 2000-2001 2001-2002 2002-2003 2003-2004 2004-2005 2005-2006 ... 2009-2010 2010-2011 2011-2012 2012-2013 2013-2014 2014-2015 2015-2016 2016-2017 2017-2018 2018-2019
1 Alabama 3.6 3.7 1.6 -0.4 2.6 2.5 6.5 3.4 1.9 ... 2.3 1.4 0.5 1.2 -0.7 1.0 1.1 1.1 2.3 1.4
2 Alaska -2.6 -1.1 -3.0 3.7 4.6 -1.8 3.8 3.0 7.7 ... -3.0 0.9 5.5 -5.1 -2.3 1.0 -1.3 -0.9 0.2 0.6
3 Arizona 8.7 8.2 4.9 2.3 3.2 6.4 4.2 8.2 5.4 ... 1.0 2.2 1.9 0.7 1.5 2.9 3.3 3.7 3.9 3.1
4 Arkansas 2.5 5.4 1.0 -0.2 3.2 4.0 4.6 3.5 2.2 ... 2.6 2.0 0.4 2.4 1.3 1.1 0.5 1.3 1.5 0.6
5 California 6.6 7.7 8.0 -0.1 2.4 4.7 4.0 4.5 4.2 ... 1.5 1.6 2.5 3.5 4.4 5.2 3.4 4.3 3.1 3.4

5 rows × 23 columns

In [12]:
year = gdp.columns[2:]
df_gdp = gdp.melt(id_vars=["GeoName"], 
        var_name="year", 
        value_name="gdp")
df_gdp.year = df_gdp.year.apply(lambda x: int(x.split('-')[1]))
df_gdp.GeoName = df_gdp.GeoName.apply(lambda x: us.states.lookup(x).abbr if us.states.lookup(x) is not None else x)
In [13]:
df_gdp.rename(columns ={'GeoName':'state'},inplace=True)
df_gdp.head()
Out[13]:
state year gdp
0 AL 1998 3.6
1 AK 1998 -2.6
2 AZ 1998 8.7
3 AR 1998 2.5
4 CA 1998 6.6
In [14]:
states = df_gdp[['state','year','gdp']].groupby(['state']).mean().reset_index().sort_values(by='gdp',ascending=False)[:5]['state'].values.tolist()
px.line(df_gdp[df_gdp['state'].isin(states)][['state','year','gdp']].groupby(['state','year']).median().reset_index(), 
        x='year', 
        y='gdp',
        color='state',
    title='Top 5 states with highest percentage change in gdp from year 1998 to 2019 by state')
  • CA, ID, ND, TX and UT are amonge the states with highest percente change in gdp.
  • ND had the +22.3% gdp change from preceding year in 2012 and -7.1% gdp change in 2016.
In [15]:
df = df.merge(df_gdp,left_on=['state','year'], right_on=['state','year'])

Get state wise employment percentage

As mentioned earlyer the another key factore of economy is employment rate. This data containes the state wise employment percentage of the total US employement. This data is also from year 1997 to 2019.

In [16]:
emp = pd.read_excel('Employment_by_state.xls',skiprows=5,skipfooter=5)
col = ['GeoName','1997','1998','1999','2000','2001','2002','2003','2004','2005','2006','2007','2008','2009','2010','2011','2012','2013','2014','2015','2016','2017','2018','2019']
emp.drop(0,inplace=True)
emp = emp[col]
emp.head()
Out[16]:
GeoName 1997 1998 1999 2000 2001 2002 2003 2004 2005 ... 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
1 Alabama 1.5 1.5 1.5 1.4 1.4 1.4 1.4 1.4 1.4 ... 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.3 1.3 1.3
2 Alaska * 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 ... 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.2 0.2
3 Arizona 1.6 1.7 1.7 1.7 1.7 1.7 1.8 1.8 1.9 ... 1.8 1.8 1.8 1.8 1.9 1.9 1.9 1.9 1.9 1.9
4 Arkansas 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 ... 0.9 0.9 0.9 0.9 0.9 0.8 0.8 0.8 0.8 0.8
5 California 11.4 11.6 11.7 11.6 11.7 11.7 11.7 11.7 11.7 ... 11.4 11.3 11.5 11.7 11.8 11.9 12.0 12.0 12.0 12.1

5 rows × 24 columns

In [17]:
emp = emp.melt(id_vars=["GeoName"], 
        var_name="year", 
        value_name="emp_perc")
emp.rename(columns={'GeoName':'state'},inplace=True)
emp.state = emp.state.apply(lambda x: us.states.lookup(x).abbr if us.states.lookup(x) is not None else x)
emp.year = emp.year.astype('int')
emp.emp_perc = emp.emp_perc.astype('float')
In [18]:
states = emp[['state','year','emp_perc']].groupby(['state']).mean().reset_index().sort_values(by='emp_perc',ascending=False)[:5]['state'].values.tolist()
px.line(emp[emp['state'].isin(states)][['state','year','emp_perc']].groupby(['state','year']).median().reset_index(), 
        x='year', 
        y='emp_perc',
        color='state',
    title='Top 5 states with highest percentage of employement of total employment from year 1998 to 2019 by state')
  • South east, mid east, far west, great lakes and CA has the highest contribution to employment of us.
In [19]:
scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],[0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]

tmp = emp[emp['year']>2010].groupby('state')['emp_perc'].mean().reset_index()

tmp = tmp.dropna(subset=['state','emp_perc'], how='any')
data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = tmp['state'], 
        z = tmp['emp_perc'].astype(float), 
        locationmode = 'USA-states', 
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "Employment percentage")
        ) ]

layout = dict(
        title = 'Average Employment percentage in different states after 2010',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = True,
            lakecolor = 'rgb(255, 255, 255)'),
             )

fig = dict( data=data, layout=layout )
py.iplot(fig)
  • CA, TX, NY, FL etc are the states with highest employment contribution.
In [20]:
df = df.merge(emp,left_on=['state','year'], right_on=['state','year'])

Get morgage interest rate

Interest rate also has major impact on house prices, as most people consider buying house with a morgage. Changes in interest rates can greatly influence a person's ability to purchase a residential property. This data is from year 1971 to 2020. It containes 30 year fixed morgage interest rate average.

In [21]:
interest = pd.read_csv("MORTGAGE30US_intrest_rate.csv")
interest.DATE = pd.to_datetime(interest.DATE)
interest.head()
Out[21]:
DATE MORTGAGE30US
0 1971-04-02 7.33
1 1971-04-09 7.31
2 1971-04-16 7.31
3 1971-04-23 7.31
4 1971-04-30 7.29
In [22]:
interest = interest.groupby(interest.DATE.dt.year).mean().reset_index()
interest.columns = ['year','interest_rate']
In [23]:
px.line(interest,x='year',y='interest_rate',title='Morgage interest rate in percentage from year 1971 to 2020')
  • Interest rate was very high during 80's
  • Moragage interest rate has been decreasing with time.
In [24]:
df = df.merge(interest,left_on='year',right_on='year')

Get state wise population percentage

Demographic of the state and contry can also impact the house price. Here population, personal income are considered for analysis. The population data containes the state wise population percentage of total population

In [25]:
population = pd.read_excel('population_by_state.xls',skiprows=5,skipfooter=6)
population = population[col]
population.drop(0,axis=0,inplace=True)
population.head()
Out[25]:
GeoName 1997 1998 1999 2000 2001 2002 2003 2004 2005 ... 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
1 Alabama 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.5 1.5 ... 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5
2 Alaska * 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 ... 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2
3 Arizona 1.7 1.8 1.8 1.8 1.9 1.9 1.9 1.9 2.0 ... 2.1 2.1 2.1 2.1 2.1 2.1 2.1 2.2 2.2 2.2
4 Arkansas 1.0 1.0 1.0 0.9 0.9 0.9 0.9 0.9 0.9 ... 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9
5 California 11.9 12.0 12.0 12.0 12.1 12.1 12.2 12.1 12.1 ... 12.1 12.1 12.1 12.1 12.1 12.1 12.1 12.1 12.1 12.0

5 rows × 24 columns

In [26]:
population = population.melt(id_vars=["GeoName"], 
        var_name="year", 
        value_name="population_perc")
In [27]:
population.rename(columns={'GeoName':'state'},inplace=True)
population.state = population.state.apply(lambda x:x.split(" ")[0])
population.year = population.year.astype('int')
population.population_perc = population.population_perc.astype('float')
population.state = population.state.apply(lambda x: us.states.lookup(x).abbr if us.states.lookup(x) is not None else x)
In [28]:
population.head()
Out[28]:
state year population_perc
0 AL 1997 1.6
1 AK 1997 0.2
2 AZ 1997 1.7
3 AR 1997 1.0
4 CA 1997 11.9
In [29]:
scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],[0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]

tmp = population.groupby('state')['population_perc'].mean().reset_index()

tmp = tmp.dropna(subset=['state','population_perc'], how='any')
data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = tmp['state'], 
        z = tmp['population_perc'].astype(float), 
        locationmode = 'USA-states', 
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "Population percentage")
        ) ]

layout = dict(
        title = 'Average population percentage in different states',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = True,
            lakecolor = 'rgb(255, 255, 255)'),
             )

fig = dict( data=data, layout=layout )
py.iplot(fig)
  • CA, TX, NY, FL have higher percentage of population
In [30]:
df = df.merge(population,left_on=['state','year'], right_on=['state','year'])
In [31]:
df.gdp = df.gdp.astype('float')
In [32]:
df['month'] = df.date.dt.month

Get state wise house price index

This data is used to how each state's house prices changes with time.

House Price Index(HPI): A house price index measures the price changes of residential housing as a percentage change from some specific start date.

In [33]:
hpi = pd.read_csv('House_price_by _state.csv')

col = ['Date','AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL',
       'GA', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME',
       'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV',
       'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA',
       'VT', 'WA', 'WI', 'WV', 'WY']

hpi = hpi.melt(id_vars=["Date"], 
        var_name="state", 
        value_name="hpi")

hpi.Date = pd.to_datetime(hpi.Date)
hpi.hpi = pd.to_numeric(hpi.hpi)

hpi['year'] = hpi.Date.dt.year
hpi['month'] = hpi.Date.dt.month

hpi.drop(columns='Date',inplace=True)
hpi.head()
Out[33]:
state hpi year month
0 AK 180.140002 2017 12
1 AK 180.191092 2017 11
2 AK 180.642452 2017 10
3 AK 180.863076 2017 9
4 AK 180.767705 2017 8
In [34]:
scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],[0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]

tmp = hpi.groupby('state')['hpi'].mean().reset_index()

tmp = tmp.dropna(subset=['state','hpi'], how='any')
data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = tmp['state'], 
        z = tmp['hpi'].astype(float), 
        locationmode = 'USA-states', 
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "House price index")
        ) ]

layout = dict(
        title = 'Average hpi in different states',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = True,
            lakecolor = 'rgb(255, 255, 255)'),
             )

fig = dict( data=data, layout=layout )
py.iplot(fig)
  • HI, ND, WY, MD, FL etc have the higher avrage of house price index.
In [35]:
df = df.merge(hpi, left_on=['state','year','month'], right_on=['state','year','month'])

Get income in us

The average personal income of us can also imapact the buyer's capability of buying the house. This data containes total monthly US personal income from year 1959 to 2020 in billion dollar.

In [36]:
income = pd.read_excel('personel income.xls',skiprows=10)

income.observation_date = pd.to_datetime(income.observation_date)
income['year'] = income.observation_date.dt.year
income['month'] = income.observation_date.dt.month

income.drop('observation_date',axis=1,inplace=True)
income.head()
Out[36]:
PI year month
0 391.8 1959 1
1 393.7 1959 2
2 396.5 1959 3
3 399.9 1959 4
4 402.4 1959 5
In [37]:
px.line(income.groupby('year').mean().reset_index(),x='year',y='PI', title='Average of US personal income in billion dollar from year 1959 to 2020')
  • Average personal income has doubled in last 15 years.
In [38]:
df = df.merge(income,left_on=['year','month'],right_on=['year','month'])
In [39]:
df.head()
Out[39]:
state date price year gdp emp_perc interest_rate population_perc month hpi PI
0 CA 1998-01-31 196065.034409 1998 6.6 11.6 6.942642 12.0 1 71.03246 7387.7
1 CA 1998-01-31 277664.673249 1998 6.6 11.6 6.942642 12.0 1 71.03246 7387.7
2 CA 1998-01-31 123916.738743 1998 6.6 11.6 6.942642 12.0 1 71.03246 7387.7
3 CA 1998-01-31 190100.308896 1998 6.6 11.6 6.942642 12.0 1 71.03246 7387.7
4 CA 1998-01-31 153469.198238 1998 6.6 11.6 6.942642 12.0 1 71.03246 7387.7
In [40]:
scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],[0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]

tmp = df.groupby('state')['gdp'].mean().reset_index()

tmp = tmp.dropna(subset=['state','gdp'], how='any')
data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = tmp['state'], 
        z = tmp['gdp'].astype(float), 
        locationmode = 'USA-states', 
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "GDP")
        ) ]

layout = dict(
        title = 'Average percent of change of GDP in different states',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = True,
            lakecolor = 'rgb(255, 255, 255)'),
             )

fig = dict( data=data, layout=layout )
py.iplot(fig)
  • ND, CA, UT, TX etc has higher percentage of change in gdp
In [41]:
px.line(df.drop_duplicates(subset=['month','year']), 
        x='date',
        y='hpi',
        title='House price index from year 1998 to 2020')
  • HPI is increasing with time.
In [42]:
px.line(df, 
        x='year', 
        y='population_perc',
        color='state',
        title='Population percentage from year 1998 to 2020')
  • All the states have almost consistent amount of population since 1998.
In [43]:
px.line(df, 
        x='year', 
        y='emp_perc',
        color='state',
        title='Employment percentage of total employment from year 1998 to 2020 by state',
        labels={'emp_perc':'Employment percentage','year':'Year'})
  • CA, TX, NY and FL have the highest employment percentage contribution.
  • All the states have consistenly maintained the employment rate since 1998
In [44]:
corr = df.corr()

plt.figure(figsize=(10,10))
sbn.heatmap(corr,annot=True,cmap='coolwarm')
plt.title('Feature correlation')
plt.show()

From above heatmap we can conclude

  • House price is increasing with increase in year, employment rate, population, house price index and Personal income.
  • House price is decreasing as interest rate and gdp increase.
  • Interest rate is decreasing with year.
In [45]:
state_data = pd.read_csv('State_time_series.csv')
state_data.drop(columns=['DaysOnZillow_AllHomes',
 'InventorySeasonallyAdjusted_AllHomes',
 'InventoryRaw_AllHomes', 'MedianPriceCutDollar_AllHomes',
 'MedianPriceCutDollar_CondoCoop',
 'MedianPriceCutDollar_SingleFamilyResidence','PctOfListingsWithPriceReductionsSeasAdj_AllHomes',
 'PctOfListingsWithPriceReductionsSeasAdj_CondoCoop',
 'PctOfListingsWithPriceReductionsSeasAdj_SingleFamilyResidence','Sale_Counts_Seas_Adj'],inplace=True)
In [46]:
state_data.Date = pd.to_datetime(state_data.Date)
state_data['year'] = state_data.Date.dt.year

Zillow Home Value Index (ZHVI): A smoothed, seasonally adjusted measure of the typical home value and market changes across a given region and housing type. It reflects the typical value for homes in the 35th to 65th percentile range

In [47]:
var = ['ZHVI_1bedroom',
 'ZHVI_2bedroom',
 'ZHVI_3bedroom',
 'ZHVI_4bedroom',
 'ZHVI_5BedroomOrMore',]
tmp = state_data.groupby('year')[var].mean().reset_index()

fig = go.Figure()
for i in var:
    fig.add_trace(go.Bar(x =tmp['year'], y=tmp[i], name=i.split('ZHVI_')[1]))
fig.update_layout(title='House value index for different year by no of rooms', xaxis_title='year',yaxis_title='ZHIV value')
fig.show()
  • House Value Index depends on no of bedroom
  • As number of bedroom increases the house price increases accordingly
  • House price is dependent on number of bedrooms
In [48]:
tmp = state_data.groupby('RegionName')['ZHVIPerSqft_AllHomes'].mean().reset_index(name = "ZHVIpersqft_mean")

tmp.RegionName = tmp.RegionName.apply(lambda x: us.states.lookup(x).abbr if us.states.lookup(x) is not None else None )
tmp = tmp.dropna(subset=['ZHVIpersqft_mean','RegionName'], how='any')
In [49]:
scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],[0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]


data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = tmp['RegionName'], 
        z = tmp['ZHVIpersqft_mean'].astype(float), 
        locationmode = 'USA-states', 
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "Home value per square foot")
        ) ]

layout = dict(
        title = 'Average house value per square foot in different states',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = True,
            lakecolor = 'rgb(255, 255, 255)'),
             )

fig = dict( data=data, layout=layout )
py.iplot(fig)
  • HI and CA have the highest house value per square foot.
In [50]:
px.bar(state_data.groupby('year')['ZHVIPerSqft_AllHomes'].mean().reset_index(),
       x='year',
       y='ZHVIPerSqft_AllHomes',
       title='Average house value per square foot in different year',
       labels={'ZHVIPerSqft_AllHomes':'House value index per sqft'})
In [51]:
fig = go.Figure()

var = ['PctOfHomesIncreasingInValues_AllHomes','PctOfHomesDecreasingInValues_AllHomes']
tmp = state_data.groupby('year')[var].median().dropna().reset_index()
for i in var:
    fig.add_trace(go.Scatter(x=tmp['year'], y=tmp[i],
                        mode='lines',
                        name=i))
fig.update_layout(title ='Percentage of house price increasing and decreasing in values', xaxis_title='year',yaxis_title='Percentage of house price')
fig.show()
  • Both the line alters between year 2008 and 2011
  • In year 2008 to 2011 only 20% to 21% of the house had increase in the value.
In [52]:
fig = go.Figure()

var = ['MedianRentalPrice_1Bedroom',
 'MedianRentalPrice_2Bedroom',
 'MedianRentalPrice_3Bedroom',
 'MedianRentalPrice_4Bedroom',
 'MedianRentalPrice_5BedroomOrMore',
 'MedianRentalPrice_CondoCoop',
 'MedianRentalPrice_DuplexTriplex',
 'MedianRentalPrice_MultiFamilyResidence5PlusUnits',
 'MedianRentalPrice_SingleFamilyResidence']

tmp = state_data.groupby('year')[var].mean().dropna().reset_index()

for i in var:
    fig.add_trace(go.Scatter(x=tmp['year'], y=tmp[i],
                        mode='lines',
                        name=i.split('MedianRentalPrice_')[1]))
fig.update_layout(title ='Median rental price per sqft', xaxis_title='year',yaxis_title='Price per sqft')
fig.show()
  • Rental price per sqft increases as no of bedroom increases

Conclusion:

  • States like CA, CO, FL, MA, UT etc have the highest house price throught out the time, So here the state is one of the variable to imapact the house price
  • House prices are increasing and morgage interest rates are decreasing with time, So another variable is morgage interest rate.
  • A house price index measures the price changes of residential housing as a percentage change from some specific start date, here House price index is increasing with time.
  • States like CA, TX, NT and FL have the highest employment rate, which indicate the higher migration rate, hence high demand of houses which can boost the house prices in this states.
  • Personal income in united states has been increased to double(from 8k to 16k) which also increases the buyer capability to buy a house.
  • House prices have positive correlation with time, employment rate, population, HPI and personal income, which means it is increasing with this variables
  • House price have negative correlation with interest rate and GDP, which means it decreases with this variables.
  • Morgage interest rate has been decreased with time, which contributed in the hike of the house prices.
  • House price is dependent on no of bedrooms and size of the house.

Key variables that will inpact the house prices are

  1. Population
  2. Interest rate
  3. Employment rate
  4. Personal income
  5. GDP

Some other variable which impacts the house price will be

  1. Number of bedrooms in the house
  2. Size of the house
  3. Locality
  4. Age of the house
  5. Construction quality
  6. Funishing
  7. Rental price per square foot etc.

References